Parsing with subdomain instance weighting from raw corpora

نویسندگان

  • Barbara Plank
  • Khalil Sima'an
چکیده

The treebanks that are used for training statistical parsers consist of hand-parsed sentences from a single source/domain like newspaper text. However, newspaper text concerns different subdomains of language use (e.g. finance, sports, politics, music), which implies that the statistics gathered by generative statistical parsers are averages over subdomain statistics. In this paper we explore a method, subdomain instance-weighting, that exploits raw subdomain corpora for introducing subdomain statistics into a state-of-the-art generative parser. We employ instance-weighting for creating an ensemble of subdomain specific versions of the parser, and explore methods for amalgamating their predictions. Our experiments show that subdomain statistics extracted from raw corpora can even improve the quality of the n-best lists of a formidable, state-of-the-art parser.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Subdomain Sensitive Statistical Parsing using Raw Corpora

Modern statistical parsers are trained on large annotated corpora (treebanks). These treebanks usually consist of sentences addressing different subdomains (e.g. sports, politics, music), which implies that the statistics gathered by current statistical parsers are mixtures of subdomains of language use. In this paper we present a method that exploits raw subdomain corpora gathered from the web...

متن کامل

Sentence-Level Instance-Weighting for Graph-Based and Transition-Based Dependency Parsing

Instance-weighting has been shown to be effective in statistical machine translation (Foster et al., 2010), as well as crosslanguage adaptation of dependency parsers (Søgaard, 2011). This paper presents new methods to do instance-weighting in stateof-the-art dependency parsers. The methods are evaluated on Danish and English data with consistent improvements over unadapted baselines.

متن کامل

Robustness beyond shallowness: incremental deep parsing

Robustness is a key issue for natural language processing in general and parsing in particular, and many approaches have been explored in the last decade for the design of robust parsing systems. Among those approaches is shallow or partial parsing, which produces minimal and incomplete syntactic structures, often in an incremental way. We argue that with a systematic incremental methodology on...

متن کامل

A Framework for Compiling High Quality Knowledge Resources From Raw Corpora

The identification of various types of relations is a necessary step to allow computers to understand natural language text. In particular, the clarification of relations between predicates and their arguments is essential because predicate-argument structures convey most of the information in natural languages. To precisely capture these relations, wide-coverage knowledge resources are indispe...

متن کامل

تأثیر ساخت‌واژه‌ها در تجزیه وابستگی زبان فارسی

Data-driven systems can be adapted to different languages and domains easily. Using this trend in dependency parsing was lead to introduce data-driven approaches. Existence of appreciate corpora that contain sentences and theirs associated dependency trees are the only pre-requirement in data-driven approaches. Despite obtaining high accurate results for dependency parsing task in English langu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008